Certified Machine Learning Associate Databricks Exam Questions and Answers

Question 1

Which of the following machine learning algorithms typically uses bagging?

A. Gradient boosted trees
B. K-means
C. Random forest
D. Linear regression
E. Decision tree

Answer : C

Question 2

The implementation of linear regression in Spark ML first attempts to solve the linear regression problem using matrix decomposition, but this method does not scale well to large datasets with a large number of variables.
Which of the following approaches does Spark ML use to distribute the training of a linear regression model for large data?

A. Logistic regression
B. Spark ML cannot distribute linear regression training
C. Iterative optimization
D. Least-squares method
E. Singular value decomposition

Answer : C

Question 3

A machine learning engineer is converting a decision tree from sklearn to Spark ML. They notice that they are receiving different results despite all of their data and manually specified hyperparameter values being identical.
Which of the following describes a reason that the single-node sklearn decision tree and the Spark ML decision tree can differ?

A. Spark ML decision trees test every feature variable in the splitting algorithm
B. Spark ML decision trees automatically prune overfit trees
C. Spark ML decision trees test more split candidates in the splitting algorithm
D. Spark ML decision trees test a random sample of feature variables in the splitting algorithm
E. Spark ML decision trees test binned features values as representative split candidates

Answer : E

Question 4

A data scientist is using MLflow to track their machine learning experiment. As a part of each of their MLflow runs, they are performing hyperparameter tuning. The data scientist would like to have one parent run for the tuning process with a child run for each unique combination of hyperparameter values. All parent and child runs are being manually started with mlflow.start_run.
Which of the following approaches can the data scientist use to accomplish this MLflow run organization?

A. They can turn on Databricks Autologging
B. They can specify nested=True when starting the child run for each unique combination of hyperparameter values
C. They can start each child run inside the parent run's indented code block using mlflow.start_run()
D. They can start each child run with the same experiment ID as the parent run
E. They can specify nested=True when starting the parent run for the tuning process

Answer : B

Question 5

Which of the following approaches can be used to view the notebook that was run to create an MLflow run?

A. Open the MLmodel artifact in the MLflow run page
B. Click the “Models” link in the row corresponding to the run in the MLflow experiment page
C. Click the “Source” link in the row corresponding to the run in the MLflow experiment page
D. Click the “Start Time” link in the row corresponding to the run in the MLflow experiment page

Answer : C

Question 6

A data scientist is developing a machine learning pipeline using AutoML on Databricks Machine Learning.
Which of the following steps will the data scientist need to perform outside of their AutoML experiment?

A. Model tuning
B. Model evaluation
C. Model deployment
D. Exploratory data analysis

Answer : C

Question 7

A machine learning engineer has grown tired of needing to install the MLflow Python library on each of their clusters. They ask a senior machine learning engineer how their notebooks can load the MLflow library without installing it each time. The senior machine learning engineer suggests that they use Databricks Runtime for Machine Learning.
Which of the following approaches describes how the machine learning engineer can begin using Databricks Runtime for Machine Learning?

A. They can add a line enabling Databricks Runtime ML in their init script when creating their clusters.
B. They can check the Databricks Runtime ML box when creating their clusters.
C. They can select a Databricks Runtime ML version from the Databricks Runtime Version dropdown when creating their clusters.
D. They can set the runtime-version variable in their Spark session to “ml”.

Answer : C

Question 8

A data scientist is utilizing MLflow Autologging to automatically track their machine learning experiments. After completing a series of runs for the experiment experiment_id, the data scientist wants to identify the run_id of the run with the best root-mean-square error (RMSE).
Which of the following lines of code can be used to identify the run_id of the run with the best RMSE in experiment_id?

A.
B.
C.
D.

Answer : A

Question 9

A machine learning engineer has been notified that a new Staging version of a model registered to the MLflow Model Registry has passed all tests. As a result, the machine learning engineer wants to put this model into production by transitioning it to the Production stage in the Model Registry.
From which of the following pages in Databricks Machine Learning can the machine learning engineer accomplish this task?

A. The home page of the MLflow Model Registry
B. The experiment page in the Experiments observatory
C. The model version page in the MLflow Model Registry
D. The model page in the MLflow Model Registry

Answer : C

Question 10

A machine learning engineer has identified the best run from an MLflow Experiment. They have stored the run ID in the run_id variable and identified the logged model name as "model". They now want to register that model in the MLflow Model Registry with the name "best_model".
Which lines of code can they use to register the model associated with run_id to the MLflow Model Registry?

A. mlflow.register_model(run_id, "best_model")
B. mlflow.register_model(f"runs:/{run_id}/model”, "best_model”)
C. millow.register_model(f"runs:/{run_id)/model")
D. mlflow.register_model(f"runs:/{run_id}/best_model", "model")

Answer : B

Question 11

A new data scientist has started working on an existing machine learning project. The project is a scheduled Job that retrains every day. The project currently exists in a Repo in Databricks. The data scientist has been tasked with improving the feature engineering of the pipeline’s preprocessing stage. The data scientist wants to make necessary updates to the code that can be easily adopted into the project without changing what is being run each day.
Which approach should the data scientist take to complete this task?

A. They can create a new branch in Databricks, commit their changes, and push those changes to the Git provider.
B. They can clone the notebooks in the repository into a Databricks Workspace folder and make the necessary changes.
C. They can create a new Git repository, import it into Databricks, and copy and paste the existing code from the original repository before making changes.
D. They can clone the notebooks in the repository into a new Databricks Repo and make the necessary changes.

Answer : A

Question 12

A machine learning engineering team has a Job with three successive tasks. Each task runs a single notebook. The team has been alerted that the Job has failed in its latest run.
Which of the following approaches can the team use to identify which task is the cause of the failure?

A. Run each notebook interactively
B. Review the matrix view in the Job’s runs
C. Migrate the Job to a Delta Live Tables pipeline
D. Change each Task’s setting to use a dedicated cluster

Answer : B

Question 13

A data scientist is using Spark SQL to import their data into a machine learning pipeline. Once the data is imported, the data scientist performs machine learning tasks using Spark ML.
Which of the following compute tools is best suited for this use case?

A. Single Node cluster
B. Standard cluster
C. SQL Warehouse
D. None of these compute tools support this task

Answer : B

Question 14

A machine learning engineer is trying to perform batch model inference. They want to get predictions using the linear regression model saved at the path model_uri for the DataFrame batch_df. batch_df has the following schema: customer_id STRING
The machine learning engineer runs the following code block to perform inference on batch_df using the linear regression model at model_uri:

In which situation will the machine learning engineer’s code block perform the desired inference?

A. When the Feature Store feature set was logged with the model at model_uri
B. When all of the features used by the model at model_uri are in a Spark DataFrame in the PySpark
C. When the model at model_uri only uses customer_id as a feature
D. This code block will not perform the desired inference in any situation.
E. When all of the features used by the model at model_uri are in a single Feature Store table

Answer : A

Question 15

Which of the following evaluation metrics is not suitable to evaluate runs in AutoML experiments for regression problems?

A. F1
B. R-squared
C. MAE
D. MSE

Answer : A

Certified Machine Learning Associate v1.0

Question 1

Question 2

Question 3

Question 4

Question 5

Question 6

Question 7

Question 8

Question 9

Question 10

Question 11

Question 12

Question 13

Question 14

Question 15

Talk to us!